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ABSTRACT:  The  open  and  distributed  systems,  that  are  the  most  important  systems  used  in  the  field  of 
command  and  control  systems,  must  never  fail.  But  only  ideal  system  would  be  perfectly  reliable  and  never  fail.  Fault 
tolerance  is  the  best  guarantee  that  high- confidence  systems  will  not  succumb  to  physical,  design,  or  human-machine 
interaction  faults. 

A fault  tolerant  system  is  one  that  can  continue  to  operate  reliably  by  producing  acceptable  outputs  in  spite  of 
occasional  occurrences  of  component  failures. 

A fault  tolerant  cluster  is  a cluster  with  a set  of  independent  nodes,  connected  over  a network,  and  always  with 
external  storage  devices  connected  to  the  nodes  on  a common  input/output  bus.  The  cluster  software  is  a layer  that  runs 
on  top  of  local  operating  systems  running  on  each  computer.  Clients  are  connected  over  the  networks  to  a server 
application  that  is  executing  on  the  nodes.  The  nodes  of  a cluster  are  connected  in  a loosely  coupled  manner,  each 
maintaining  its  own  separate  processors,  memory,  and  operating  system.  Special  communications  protocols  and  system 
processes  bind  these  nodes  together  and  allow  them  to  cooperate  to  provide  outstanding  levels  of  availability  and 
flexibility  for  supporting  mission  critical  applications. 

One  of  the  most  important  problems  in  implementing  fault  tolerant  system  is  the  identification  of  single  points 
of  failure  and  elimination  of  these  single  points  of  failure  by  using  replaceable  units.  When  a component  becomes 
unavailable,  fault  tolerant  cluster  software  detects  the  loss  and  shifts  that  component’s  workload  to  another  component 
in  the  cluster.  The  failure  recovery  is  done  automatically,  without  any  human  intervention. 

The  need  for  interoperability  between  the  M&S  world  and  the  Command  and  Control  world  has  been 
formulated  in  several  publications.  The  challenge  even  increases  when  NATO  and  PfP  Nations  demands  to  train  using 
their  own  simulation  systems  as  well  as  their  own  command  and  control  systems.  The  key  issue  for  the  Command, 
Control,  Communications,  Computers,  Intelligence,  Surveillance,  and  Reconnaissance  (C4ISR)  community  is  the 
interoperability  between  live  or  real  C4ISR  systems  and  modeling  and  simulation  systems. 

Today,  some  of  the  architectural  key  words  that  are  in  common  in  the  modem  command  and  control  systems 
and  fault  tolerant  systems  are  the  following: 

-Open  and  distributed  systems; 

-Networks; 

-High  level  operating  systems; 

-Hierarchical  architecture: 

-Interoperability  and  reusability; 

-High  availability  systems. 


Paper  presented  at  the  RTO  1ST  Symposium  on  “New  Information  Processing  Techniques  for  Military  Systems  ”, 
held  in  Istanbul,  Turkey,  9-1 1 October  2000,  and  published  in  RTO  MP-049. 
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1 INTRODUCTION 

In  the  past,  command  and  control  systems  were  designed 
and  developed  to  implement  a constrained,  prescribed 
system  specification.  Each  organisation  built  systems  to 
meet  its  own  information  requirements,  with  little 
concern  for  satisfying  tlie  requirements  of  others,  or  for 
considering  in  advance  the  need  for  information 
exchange.  Incorporating  the  potential  for  growth  and 
change  were  not  essential  criteria.  Also,  design  of 
command  and  control  systems  w'as  driven  by  available 
technology.  Limitations  in  areas  such  as  bandwidth, 
peripherals,  and  reliability  of  the  components  did  not 
allow  the  realization  of  dishibuted  systems  that  to  be 
used  for  command  and  control. 

But,  command  and  control  systems  must  continue  to 
evolve  and  adapt  to  new  requirements  over  an  extended 
period  of  time.  The  architecture  of  these  systems  should 
be  flexible  enough  to  handle  any  component  or 
subcomponent  changes  that  are  required  to  make  the 
system,  as  a whole,  meet  all  requirements,  and  to  offer 
the  flexibility  to  support  multiple  configurations  of  the 
architecture  needed  for  specific  objectives,  and  the 
ability  to  sustain  changes  in  design  over  the  program  life 
cycle. 

Now,  advances  in  information  technology  pennit  that 
these  systems  to  be  designed  around  function  and 
flexibility,  not  hardware.  What  is  important  this  time  in 
command  and  control  systems  design  is  not  the 
technology,  but  the  information  exchange  requirements, 
the  mission  organizational  rules  and  functions. 

2 FAULT  TOLERANCE  IN  HIGH 
AVAILABILITY  CLUSTERS 

2.1  FAULT  TOLERANCE 

The  open  and  distributed  systems,  which  are  the  most 
important  systems  used  for  command  and  control,  must 
never  fail.  But  only  ideal  system  would  be  perfectly 
reliable  and  never  fail.  This,  of  course,  is  impossible  to 
be  achieved  in  practice,  because  the  systems  fail  for 
many  reasons.  Fault  tolerance  is  the  best  guarantee  that 
high-confidence  systems  will  not  succumb  to  physical, 
design,  or  human-machine  interaction  faults,  or  will 
allow  viruses  and  malicious  acts  to  disrupt  essential 
serx'ices. 

A fault  tolerant  system  is  one  that  can  continue  to 
operate  reliably  by  producing  acceptable  outputs  in  spite 
of  occasional  occurrences  of  component  failures. 

The  basic  principle  of  fault-tolerant  design  is  the  use  of 
redundancy,  and  there  are  three  basic  techniques  to 
achieve  fault  tolerance:  spatial  (redundant  hardware), 
informational  (redundant  data  structures),  and  temporal 
(redundant  computation). 


Redundancy  costs  both  money  and  time.  The  designers 
of  fault-tolerant  systems,  therefore,  must  optimize  their 
designs  by  trading  off  the  amount  of  redundancy  used 
and  the  desired  level  of  fault  tolerance.  Computational 
redundancy  usually  requires  recomputation  and  the 
typical  result  is  a slower  recovery  from  failure, 
compared  to  hardware  redundancy.  On  the  other  hand, 
hardware  redundancy  increases  hardware  costs,  weight, 
and  power  requirements. 

The  classical  hardware  and  software  fault  tolerant 
techniques  are  modular  redundancy,  N-version 
programming,  error-control  coding,  checkpoints, 
rollbacks,  and  recovery  blocks. 

2.2  REPLACEABLE  UNITS 

Modern  systems  are  partitioned  at  several  levels  based 
on  functions  provided  by  specific  subsystems.  A fault- 
tolerant  system  displays  similar  functional  partitioning, 
but  in  addition  it  contains  redundant  components  and 
recovery  mechanisms  which  may  be  employed  in 
different  ways  at  different  levels.  It  is  reasonable  to  view 
a fault-tolerant  system  as  a nested  set  of  subsystems  each 
of  which  may  display  varying  levels  of  fault  tolerance. 
Recovery  from  a fault  within  a redundant  partition  may 
be  effected  within  the  domain  itself,  or  may  require 
action  by  higher  levels  within  the  system. 

Fault  tolerant  architectures  package  these  redundant 
partitions  into  replaceable  units.  A replaceable  unit  is  a 
unit  of  failure,  replacement  and  growth  - that  is,  a unit 
that  fails  independently  of  other  units,  which  can  be 
removed  without  affecting  other  units,  and  can  be  added 
to  a system  to  augment  its  performance,  capacity,  or 
availability. 

The  desired  result  of  system  partitioning  and  subsystem 
design  is  an  integrated  set  of  local,  intermediate,  and 
global  fault  tolerance  functions  that  serve  as  a protective 
infrastructure  to  ensure  the  timely  and  correct  delivery  of 
system  services. 

2.3  CLUSTERS 

A distributed  system  is  a collection  of  computers  (called 
nodes)  that  communicate  with  each  other  through  a 
communication  medium.  Under  the  control  of  systems 
software,  the  nodes  can  co-operatively  carry  out  a task. 
An  open  system  allows  system  integration,  so  the 
customers  can  choose  various  hardware  and  software 
components  from  different  vendors  and  integrate  them  to 
create  a custom  configuration  suiting  their  needs  and 
cost  requirements. 

A cluster  is  a set  of  loosely  coupled,  independent 
computer  systems,  connected  over  a network  that  behave 
as  a single  system.  The  cluster  software  is  a layer  that 
runs  on  top  of  local  operating  systems  running  on  each 
computer.  Client  applications  interact  with  a cluster  as  if 
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it  is  a single  high-performance,  highly  reliable  server.  connected  over  the  networks  to  a server  application  that 

System  managers  view  a cluster  much  as  they  see  a is  executing  on  the  nodes.  The  nodes  of  a cluster  are 

single  server.  Most  applications  will  run  on  a cluster  coimected  in  a loosely  coupled  manner,  each 

without  any  modification  at  all.  And  only  standard-based  maintaining  its  own  separate  processors,  memory,  and 

hardware  components  such  as  SCSI  disks  and  Ethernet  operating  system.  Special  communications  protocols  and 

LANs  are  used  to  create  a cluster.  system  processes  bind  these  nodes  together  and  allow 

them  to  cooperate  to  provide  outstanding  levels  of 
Clustering  can  take  many  forms.  A cluster  may  be  availability  and  flexibility  for  supporting  mission-critical 

nothing  more  than  a set  of  standard  personal  computers  applications.  Fault  tolerant  clusters  maintain  strict 

interconnected  by  Ethernet.  At  the  other  end  of  the  compliance  to  the  principles  of  open  systems.  There  are 

spectrum,  the  hardware  structure  may  consist  of  high-  no  proprietary  application  programming  interfaces  that 

performance  symmetric  multiprocessor  systems  force  vendor  lock-in  and  require  substantial  development 

connected  via  a high-performance  communications  and  investment.  Most  applications  will  run  on  a fault  tolerant 

I/O  bus.  In  both  cases,  processing  power  can  be  cluster  without  any  modification  at  all. 

increased  in  small  incremental  steps  by  adding  another 

commodity  system.  The  top-level  software  of  a fault  tolerant  cluster  can  be 

designed  to  maximize  the  flexibility  of  configurations 
If  one  system  in  a cluster  fails,  its  workload  can  be  within  a local  cluster.  Clusters  may  be  formed  with  a 

automatically  dispersed  among  the  remaining  systems.  different  number  of  nodes.  This  flexibility  in  system 

This  transfer  is  frequently  transparent  to  the  client.  selection  and  cluster  configuration  protects  customer 

investments  in  installed  systems  and  allows  the 
2.4  FAULT  TOLERANT  CLUSTERS  processing  power  of  each  node  to  be  matched  with  the 

specific  requirements  of  each  application  service. 

A fault  tolerant  cluster  is  a cluster  with  a set  of 

independent  nodes,  connected  over  a network,  and  Fig.  1 shows  a sample  configuration  for  a fault  tolerant 

always  with  external  storage  devices  connected  to  the  cluster, 

nodes  on  a common  input'output  bus.  Clients  are 


Fig.  1 A sample  configuration  for  a fault  tolerant  cluster 
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If  the  failure  of  any  component  in  a cluster  results  in  the 
unavailability  of  service  to  the  end  user,  this  component 
is  called  a single  point  of  failure  for  the  cluster.  One  of 
the  most  important  problems  in  implementing  fault 
tolerant  system  is  the  identification  of  single  points  of 
failure  and  elimination  of  these  single  points  of  failure 
by  using  replaceable  units. 

The  elimination  of  a single  point  of  failure,  by  using 
replaceable  units,  always  has  a cost  associated  witli  it. 
Usually,  what  can  be  done  is  only  to  attempt  to  make  a 
service  highly  available  if  the  cost  of  losing  the  service 
is  greater  than  the  cost  of  protecting  it. 

The  possible  single  points  of  failure  that  a cluster  could 
have  are: 

-Nodes  in  the  cluster, 

-Disks  used  to  store  application  or  data,  adapters, 
controllers  and  cables  used  to  connect  the  nodes  to  the 
disks, 

-The  network  backbones  over  which  the  user  are 
accessing  the  cluster  nodes  and  network  adapters 
attached  to  each  node, 

-Power  sources, 

-Applications. 

A high  availability  cluster  is  a grouping  of  serv'ers 
having  sufficient  redundancy  of  software  and  hardware 
components  that  a failure  will  not  disrapt  the  availability 
of  computer  services.  To  develop  a complete  high 
availability  solution,  is  necessary  to  be  maintained  high 
availability  within  a hierarchy  of  system  levels,  some  of 
which  go  beyond  the  cluster  level.  Failure  at  all  levels 
must  be  detected  quickly  and  a fast  response  provided. 
When  a component  becomes  unavailable,  fault  tolerant 
cluster  software  detects  the  loss  and  shifts  that 
component’s  workload  to  another  component  in  the 
cluster.  The  failure  recoveiy  is  done  automatically, 
without  any  human  intervention.  At  the  same  time, 
planned  maintenance  events  at  all  levels  must  be 
possible  with  minimum  disruption  of  service. 

2.5  ELIMINATING  NODES  AS  SINGLE  POINTS 
OF  FAILURE 

The  node  in  a fault  tolerant  system  consists  of  a group  of 
components,  any  of  which  can  fail.  The  most  important 
components  are: 

-One  or  more  central  processing  units, 

-Memory  boards, 

-Inpul/output  controllers. 

The  use  of  cluster  architecture  lets  the  system  eliminate  a 
node  as  a single  point  of  failure  without  losing  service. 

The  nodes  are  connected  to  each  other  by  a local  area 
network,  which  allows  them  to  accept  client  connections 
and  to  transmit  messages  that  confirm  each  other’s 
health.  If  one  node  fail,  the  failured  node  is  removed 


from  the  cluster  and,  after  only  a brief  delay,  its 
resources  are  taken  over  by  the  node  configured  to  do  so, 
so  called  the  takeover  node.  This  process  is  known  as 
failover.  The  process  of  failover  is  handled  by  special 
high  availability  software  running  on  all  nodes  in  the 
cluster.  Different  types  of  clusters  use  different  cluster 
management  and  failover  techniques.  There  are  specific 
differences  in  cluster  types  and  their  high  availability 
software. 

In  fault  tolerant  clusters,  disks  containing  data  are 
physically  connected  to  multiple  nodes  on  a common 
inpuftoutput  bus.  When  a node  that  owns  a disk  fail,  a 
surviving  node  assumes  control  of  the  disk,  so  that  the 
critical  data  remains  available. 

Clients  and  other  devices  are  connected  over  the 
networks  to  the  nodes.  After  the  failover,  all  the  clients 
and  network  devices  connected  to  the  failed  node  can 
access  the  second  node  as  easily  as  the  first.  When  a 
node  failed  during  the  mnning  of  a critical  application,  a 
takeover  node  can  restart  that  application  so  that  the 
service  is  not  lost. 

2.6  ELIMINATING  DISKS  AS  SINGLE  POINTS 
OF  FAILURE 

A fault  tolerant  solution  improves  data  availability  by 
allowing  that  a number  of  nodes  to  share  the  same  hard 
disks  within  a cluster.  When  a node  in  the  cluster  fails, 
the  fault  tolerant  cluster  software  will  recover  and 
disperse  the  w'ork  from  the  failed  node  to  another  node 
within  the  cluster.  As  a result,  the  failure  of  a system  in 
the  cluster  will  not  affect  the  other  systems,  and  in  most 
cases,  the  client  applications  will  be  completely  unaware 
of  the  failure. 

Each  node  in  a cluster  has  its  own  root  disks,  but  each 
node  may  also  be  physically  connected  to  several  other 
disks  in  such  a way  that  multiple  nodes  can  access  the 
same  data.  On  such  systems,  this  cluster-oriented  access 
is  provided  by  a software  cluster  component  called 
Logical  Volume  Manager.  Access  may  be  exclusive  or 
shared,  depending  on  the  kind  of  cluster  created. 
Redundancy  is  necessary  to  prevent  the  failure  of  disk 
media  or  a disk  controller. 

Different  fault  tolerant  configurations  provide  a range  of 
solutions  that  address  varying  levels  of  fault  protection 
requirements,  including  multi-site  resiliency  solutions. 
There  are  solutions  that  combine  local  fault  tolerant 
cluster  configurations  with  Fibre  Channel  mass-storage 
devices  in  order  to  provide  a disaster-tolerant  solution 
for  a clustering  environment  up  to  40  kilometers  apart. 

The  most  important  two  methods  available  for  providing 
disk  redundancy  are:  using  disk  arrays  in  a RAID 
configuration  and  using  software  mirroring.  Each 
approach  has  its  own  advantages. 
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RAID  (Redundant  Array  of  Inexpensive  Disk)  is  a disk 
technology  that  is  designed  to  provide  improved 
availability,  security  and  perfomiance  over  conventional 
disk  systems.  While  appearing  logically  to  the  operating 
system  as  a single  disk  drive,  a RAID  array  is  actually 
made  up  of  several  disks,  which  have  their  data  spread 
across  the  drives  in  any  of  several  different  methods. 
The  group  of  disks  that  function  together  in  a well- 
defined  arrangement  is  known  as  RAID  level.  RAID 
Level  1 allows  hardware  mirroring,  while  otliers  provide 
protection  through  the  use  of  parity  data.  The  RAID 
levels  allow  the  array  to  reconstruct  lost  data  if  a disk 
fails. 

In  addition,  arrays  can  be  configured  in  independent 
mode,  w'hich  means  that  each  member  of  the  array  is 
seen  as  an  independent  disk. 

Some  of  the  advantages  of  using  disk  arrays  for 
protected  data  storage  are  as  follows: 

-Data  redundancy. 

-Redundant  power  supplies  and  cooling  fans. 

-Online  maintenance. 

-Highest  storage  connectivity. 

-Flexibility  in  configuration  (different  modes  available). 
-On  some  devices,  dual  controllers  can  eliminate 
additional  single  points  of  failure. 

An  alternative  technique  for  providing  protected  data 
storage  is  the  use  of  software  mirroring,  which  allows 
that  a single  logical  filesystem  to  be  implemented  on 
multiple  physical  copies  in  a way  that  is  transparent  to 
users  and  applications.  It  means  that  if  a disk  or  sectors 
of  a disk,  containing  a copy  of  the  data,  should  fail  the 
data  will  still  be  accessible  from  another  copy  on  another 
disk.  Note  that  the  mirror  copy  is  on  a separate 
input/output  bus.  This  arrangement  eliminates  the  disk, 
the  input/output  card  and  the  bus  as  single  points  of 
failure. 

2.7  ELIMINATING  NETWORKS  AS  SINGLE 
POINTS  OF  FAILURE 

Networks  are  configured  and  used  in  clustered  systems 
for  access  to  an  application  by  clients  or  other  systems, 
and  for  communication  between  cluster  nodes.  In  a fault 
tolerant  cluster,  the  software  establishes  a 
communieation  link  knows  as  a heartbeat.  It  is 
recommended  that  the  fault  tolerant  cluster  to  be 
designed  with  more  than  one  network,  so  that  high  level 
cluster  software  has  at  least  one  network  at  all  times  that 
it  can  use  to  monitor  the  status  of  cluster  nodes. 

This  special  use  of  networking  must  itself  be  protected 
against  failures.  Points  of  failure  in  the  network  include 
the  LAN  interfaces  and  cables  connected  to  each  node. 
In  the  cluster  the  entire  communication  link  from  the 
client  system  to  the  application  server  is  subject  to 
failures  of  various  kinds.  Depending  on  the  type  of  LAN 


hardware,  failures  may  occur  in  cables,  interface  cards, 
network  routers,  hubs,  or  concentrators. 

All  these  single  points  of  network  failure  can  be 
eliminated  by  providing  fully  redundant  LAN 
connections,  and  by  configuring  local  switching  of  LAN 
interfaces.  To  protect  against  network  adapter  failure,  a 
second  network  adapter  would  be  configured  to  the  same 
network  backbone.  If  the  fault  tolerant  cluster  is 
designed  with  more  tlian  one  network,  two  network 
adapters  will  be  used  for  all  the  network  backbones.  For 
eliminating  the  loss  of  client  connectivity,  can  also  be 
configured  redundant  routers  or  redundant  hubs  through 
which  clients  can  access  the  services  of  the  cluster. 
Another  way  to  eliminate  points  of  failure  is  to  configure 
local  switching,  which  means  shifting  from  a configured 
LAN  interface  card  to  a standby. 

2.8  ELIMINATING  POWER  SOURCES  AS 
SINGLE  POINTS  OF  FAILURE 

Different  methods  can  be  used  for  eliminating  power 
sources  as  single  points  of  failure.  The  use  of  multiple 
power  circuits  with  different  circuit  breakers  reduces  the 
likelihood  of  a complete  power  outage.  An 
uninterruptible  power  supply  provides  standby  in  the 
event  of  an  interruption  to  the  power  source.  Small  local 
uninterruptible  power  supply  can  be  used  to  protect 
individual  system  processor  units  and  data  disks.  Large 
power  passthrough  units  can  protect  the  power  supply  to 
an  entire  computer  system. 

2.9  ELIMINATING  APPLICATIONS  AS  SINGLE 
POINTS  OF  FAILURE 

The  software  of  a fault  tolerant  cluster  is  a layer  that  runs 
on  top  of  local  operating  systems  running  on  each 
computer.  The  cluster  management  software  provides 
services  like  as  failure  detection,  recovery,  load 
balancing,  and  the  ability  to  manage  the  servers  as  a 
single  system.  This  high  level  software  monitors  local 
hardware  and  software  subsystems,  tracks  the  states  of 
the  nodes,  and  quickly  responds  to  failures  in  a way  that 
eliminates  or  minimizes  applications  downtime,  and 
provides  a number  of  important  other  benefits,  including 
improved  availability,  easier  manageability,  and  cost- 
effective  scalability. 

The  critical  applications  and  data  are  housed  on  disk 
devices  that  are  physically  cabled  to  cluster  nodes.  This 
shared  physical  connection  allows  the  ownership  of 
shared  logical  volumes  and  their  contents  to  be  quickly 
switched  from  one  node  to  another.  Load  balancing 
technique  allow's  the  performance  of  a server-based 
program  to  be  scaled  by  distributing  its  client  requests 
across  multiple  servers  within  the  fault  tolerant  cluster. 
The  load  balancing  management  softw'are  can  specify 
the  load  percentage  that  it  will  handle,  or  the  load  can  be 
equally  distributed  across  all  of  the  hosts.  If  a host  fails, 
the  load  balancing  mechanism  dynamically  redistributes 
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the  load  among  the  remaining  hosts.  Load  balancing 
technique  is  used  to  enhance  scalability,  which  boost 
throughput  while  keeping  response  times  low. 

The  high  level  software  of  a fault  tolerant  cluster  operates 
in  a fully  transparent  manner  to  both  server  applications 
and  to  TCP/IP  clients.  It  lets  users  employ  off-the-shelf 
software  components,  such  as  existing  WWW,  FTP,  or 
proxy  servers  and  other  popular  Internet  applications, 
and  enhances  fault-tolerance  and  scales  performance 
transparently  to  the  TCP/IP  protocol,  to  server 
applications,  and  to  clients. 

When  a host  fails  or  goes  offline,  the  high  level  software 
of  a fault  tolerant  cluster  automatically  reconfigures  the 
cluster  to  direct  client  requests  to  the  remaining 
computers.  In  addition,  for  load-balanced  ports,  the  load 
is  automatically  redistributed  among  the  computers  still 
operating,  and  ports  with  a single  server  have  their 
traffic  redirected  to  a specific  host.  While  connections  to 
the  failed  or  offline  server  are  lost,  once  the  necessary 
maintenance  is  completed,  the  offline  computer  can 
transparently  rejoin  the  cluster  and  regain  its  share  of  the 
workload.  This  robust  fault  tolerance  is  enabled  by  a 
unique  distributed  architecture,  which  avoids  the  single 
points  of  failure  or  performance  bottlenecks  of  other 
load  balancing  solutions. 

If  there  is  a node  failure,  it  shuts  down,  and  the  cluster 
reconfigures  itself;  services  that  were  on  the  failed  node 
are  made  available  on  another  system.  There  are 
different  methods  used  for  providing  services  after  the 
shutting  down  of  a node. 

One  way  is  to  have  another  node  that  take  over  the 
applications  that  were  mnning  on  the  failed  system.  By 
using  the  high-level  cluster  software,  application 
sendees  and  all  the  resources  needed  to  support  the 
application  can  be  putted  together  into  special  entities 
called  application  packages.  This  application  packages 
are  the  basic  units  that  are  managed  and  moved  within 
the  fault  tolerant  cluster.  Packages  simplify  the  creation 
and  management  of  highly  available  services  and 
provide  outstanding  levels  of  flexibility  for  workload 
balancing.  When  a package  is  failed  over  between  nodes, 
all  of  the  contents  of  the  package  are  moved  from  the 
failed  node  to  a new  node.  The  ability  to  easily  move 
application  packages  within  a fault  tolerant  cluster 
provide  outstanding  availability  during  system 
maintenance  activities  such  as  hardware  or  software 
upgrades.  Packages  can  be  moved  from  node  to  node 
with  simple  operator  commands,  allowing  scheduled 
maintenance  to  be  performed  on  one  node  of  a cluster 
while  other  nodes  continue  to  provide  support  for  critical 
applications.  When  the  maintenance  is  complete,  the 
node  rejoins  the  cluster  and  assumes  its  normal  workload 
of  application  packages.  The  same  method  can  also  be 
used  to  perform  rolling  operating  system  upgrades  across 
a cluster. 


Another  approach  for  providing  services  after  the 
shutting  down  of  a node  is  to  provide  different  instances 
of  the  same  application  running  on  multiple  nodes  so 
that  when  one  node  goes  down,  users  need  only 
reconnect  to  an  alternate  node. 

In  both  cases,  the  use  of  clusters  makes  recovery  from 
failure  possible  in  a reasonably  short  time. 

3 MODELING  AND  SIMULATION 
SYSTEMS  AND  COMMAND,  CONTROL, 
COMMUNICATIONS,  COMPUTERS, 
INTELLIGENCE,  SURVEILANCE,  AND 
RECONNAISSANCE  SYSTEMS 

3.1  INTEROPERABILITY 

Warfighter  battlespace  is  complex  and  dynamic,  and 
informalion  related  to  the  battlespace  must  flow'  quickly 
among  all  tactical,  strategic,  and  supporting  elements. 
There  is  an  unprecedented  increase  in  the  amount  of  data 
and  information  necessary  to  conduct  operational 
planning  and  combat  decision-making.  The  ability  of  the 
command,  control,  communications,  computers, 
intelligence,  surveillance,  and  reconnaissance  systems 
supporting  these  operations  to  interoperate — work 
together  and  exchange  information — is  critical  to  their 
success. 

Several  elements  are  common  to  almost  all  systems  used 
for  modeling  and  simulation.  According  with  the  ideas 
of  the  NATO  “Modelling  and  Simulation  Master  Plan” 
[8],  simulation  systems  are  used  in  the  folio wdng 
application  domains; 

-Analyses; 

-Simulation  based  Acquisition; 

-Training  and  Exercises; 

-Decision  Support. 

The  need  for  interoperability  between  the  Modeling  and 
Simulation  (M&S)  world  and  the  Command,  Control, 
Communications,  Computers,  Intelligence,  Surveillance, 
and  Reconnaissance  (C4ISR)  w'orld  has  been  formulated 
in  several  publications.  The  challenge  even  increases 
when  NATO  and  PfP  Nations  demands  to  train  using 
their  own  simulation  systems  as  well  as  their  owm 
command  and  control  systems.  The  key  issue  for  the 
C4ISR  community  is  the  interoperability  between  live  or 
real  C4ISR  systems  and  M&S  systems. 

According  to  Joint  Publication  1-02,  DoD  Dictionary  of 
Military  and  Related  Terms,  interoperability  is  defined 
in  two  context  as  follows: 

(1)  “The  ability  of  systems,  units,  or  forces  to  provide 
services  to  and  accept  services  from  other  systems,  units, 
or  forces,  and  to  use  the  services  so  exchanged  to  enable 
them  to  operate  effectively  together”; 

(2)  ‘The  condition  achieved  among  communications- 
electronic  systems  or  items  of  communications- 
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electronic  equipment  when  information  and  services  can 
be  exchanged  directly  and  satisfactorily  between  them 
and/or  their  users.  The  degree  of  interoperability  should 
be  defined  when  referring  to  specific  cases” 

Within  the  simulation  community,  the  new  and 
promising  approach  of  using  “High  Level  Architecture” 
and  “Synthetic  Environment  Data  Representation  and 
Interchange  Specification”  is  promoted  to  gain 
interoperability  and  reuse  of  the  systems.  For  realising 
the  interoperability,  the  C4ISR  community  is  moving  to 
standardize  the  Joint  Technical  Architecture  (JTA)  and 
the  Defense  Information  Infrastructure  Common 
Operating  Environment  (DII  COE).  Unfortunately,  over 
the  last  decade,  uncoordinated  standards  for  M&S-to- 
C4ISR  interoperability  have  been  and  are  currently  being 
developed  by  both  communities. 

3.2  HIGH  LEVEL  ARCHITECTURE  AND 
SYNTHETIC  ENVIRONMENT  DATA 

REPRESENTATION  AND  INTERCHANGE 
SPECIFICATION 

The  intention  of  the  High  Level  Architecture  (HLA)  and 
Synthetic  Environment  Data  Representation  and 
Interchange  Specification  (SEDRIS)  is  to  provide  a 
common  architecture  for  modeling  and  simulation,  and 
to  offer  a structure  that  will  support  the  reuse  and 
interoperability  of  simulations. 

The  High  Level  Architecture  comprises  three 
components:  the  rules,  the  interface  specification,  and 
the  object  model  template. 

-High  Level  Architecture  Rules  are  a set  of  rules  that 
must  be  followed  to  achieve  proper  interaction  of 
federates  during  a federation  execution.  These  describe 
the  responsibilities  of  federates  and  of  the  Runtime 
Infrastracture  in  High  Level  Architecture  federations. 
-The  Interface  Specification  defines  the  standard 
sers'ices  and  interfaces  to  be  used  by  federates  in  order  to 
support  efficient  information  exchange  when 
participating  in  a distributed  federation  execution  and  the 
reuse  of  the  individual  federates. 

-The  Object  Model  Templates  prescribes  the  format  and 
syntax  for  recording  the  information  in  High  Level 
Architecture  object  models,  for  each  federation  and 
federate. 

The  basic  components  of  the  High  Level  Architecture 
are  the  simulations  themselves,  or  more  generally,  the 
federates.  The  High  Level  Architecture  requires  that  all 
federates  incoiporate  specified  capabilities  to  allow  the 
objects  in  the  simulation  to  interact  wdth  objects  in  other 
simulations  through  the  exchange  of  data  supported  by 
services  implemented  in  the  Runtime  Infrastructure.  The 
Runtime  Infrastructure  is  a distributed  operating  system 
for  the  federation  which  provide  a set  of  general  purpose 
services  that  support  federate-to-federate  interactions 
and  federation  management  and  support  functions. 


The  Synthetic  Environment  Data  Representation  and 
Interchange  Specification  objectives  are  to: 

-Articulate  and  capture  the  complete  set  of  data  elements 
and  associated  relationships  needed  to  fully  represent  the 
physical  environment. 

-Support  the  full  range  of  simulation  applications  (e.g., 
computer-generated  forces,  manned,  visual,  and  sensor 
systems)  across  all  environmental  domains  (terrain, 
ocean,  atmosphere,  and  space). 

-Provide  a standard  interchange  mechanism  to  pre- 
distribute environmental  data  (from  primary  source  data 
providers  and  existing  resource  repositories)  and 
promote  data  base  reuse  and  interoperability  among 
heterogeneous  simulations. 

3.3  JOINT  TECHNICAL  ARCHITECTURE 

The  role  of  Joint  Technical  Architecture  (JTA)  is  that  of 
providing  the  foundation  for  interoperability  among  all 
tactical,  strategic,  and  combat  support  systems.  The  JTA 
provides  the  minimum  set  of  standards  that,  when 
implemented,  facilitates  the  flow  of  information  in 
support  of  the  warfighter. 

According  to  [10]  the  Joint  Technical  Architecture  must 
be: 

-A  distributed  information-processing  environment  in 
which  applications  are  integrated. 

-Applications  and  data  independent  of  hardware  to 
achieve  true  integration. 

-Information-transfer  capabilities  to  ensure  seamless 
communications  w'ithin  and  across  diverse  media. 
-Information  in  a common  format  with  a common 
meaning. 

-Common  human-computer  interfaces  for  users,  and 
effective  means  to  protect  the  information. 

Joint  Technical  Architecture  wants  to  reach  a consensus 
between  a working  set  of  standards  and  to  establish  a 
single,  unifying  technical  architecture  that  will  become 
binding  on  all  future  C4I  acquisitions  so  that  new 
systems  can  be  born  joint  and  interoperable,  and  existing 
systems  will  have  a baseline  to  move  tow'ards 
interoperability. 

3.4  INTEROPERABILITY  BETWEEN  M&S 
SYSTEMS  AND  C4ISR  SYSTEMS 

A key  task  for  the  M&S  community  is  to  link  wdth  live 
or  real  C4ISR  systems.  Within  the  C4ISR  community 
there  is  a similar  pressing  need  to  link  C4ISR  equipment 
with  simulations.  Recent  work  promises  the  cooperation 
of  these  communities  in  developing  a unified  approach 
to  linking  simulations  and  C4ISR  systems. 

According  to  [11]  in  the  near  term  simulation  control  is 
basically  one-way  with  the  simulations  initializing  the 
real  C4I  system  databases.  In  the  mid  term,  it  can  be 
expected  to  see  the  HLA  linking  constructive  and  virtual 
simulations  on  the  simulation  side  and,  via  common 
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components  found  in  C4ISR  systems,  the  HLA  also 
allowing  simulations  and  C4ISR  systems  to  exchange 
both  data  and  messages.  Simulation  initialization  will  be 
two-way  with  real  system  databases  providing 
information  to  the  simulation  side.  Ultimately,  it  can  be 
expected  to  have  full  two-way  via  common  databases, 
thus  achieving  a measure  of  seamless  interoperability. 
Finally,  an  interoperable  M&S  and  C4ISR  architecture  is 
based  on  a common  conceptual  reference  model 
accommodating  common  mediation  techniques  and 
shared  data  and  object  models,  all  linked  via  a common 
information  management  process  providing  common 
solutions  for  the  C4ISR  and  simulation  community. 

4 MODELING  AND  SIMULATION 
SYSTEMS,  COMMAND  AND  CONTROL 
SYSTEMS,  AND  FAULT  TOLERANT 
CLUSTERS 

Today,  can  be  identified  some  key  words  that  are  in 
common  in  modem  modeling  and  simulation  systems, 
command  and  control  systems  and  in  fault  tolerant 
clusters.  Some  of  them  can  be: 

-Open  and  distributed  systems; 

-Networks; 

-High  level  operating  systems; 

-Hierarchical  architecture; 

-Interoperability  and  reusability; 

-High  availability  systems. 

4.1  OPEN  AND  DISTRIBUTED  SYSTEMS 

All  modern  systems  used  for  command  and  control  must 
be  open  and  distributed  systems.  They  must  be  flexible 
and  extensible,  able  to  be  kept  up-to-date  with  state-of- 
the-art  technology,  and  to  offer  the  best  capabilities  for 
reuse  and  interoperability.  The  architecture  of  all  modem 
fault  tolerant  systems  is  that  of  a cluster,  which  is  one  of 
the  best  open  and  distributed  system. 

4.2  NETWORS 

A fault  tolerant  cluster  is  a set  of  independent  computers 
(nodes)  connected  over  a network,  and  always  with 
external  storage  devices  connected  to  the  nodes  on  a 
common  input/output  bus.  Clients  are  connected  over  the 
networks  to  a server  application  that  is  executing  on  the 
nodes.  The  nodes  of  a cluster  are  connected  in  a loosely 
coupled  manner,  each  maintaining  its  own  separate 
processors,  memory,  and  operating  system.  Special 
communications  protocols  and  system  processes  bind 
these  nodes  together  and  allow  them  to  cooperate  to 
provide  outstanding  levels  of  availability  and  flexibility 
for  supporting  mission-critical  applications. 

The  basic  High  Level  Architecture  protocol  establishes 
that  the  communications  path  between  any  federates  is 
over  the  network.  There  are  no  opportunities  for  back- 
channel  data  paths  to  corrupt  the  pruity  of  the 


architecture.  This  rigor  requires  substantial  effort  to 
design  the  models  in  the  federate  and  the  common 
functions  of  High  Level  Architecture  for  each  federate 
the  interface  data  stracture  and  the  message  transactions 
or  services  required  for  this  highly  object-oriented 
architecture.  The  resulting  architecture,  however,  offers 
the  flexibility  to  support  multiple  configurations  of  the 
architecture  needed  for  specific  modeling  and 
simulation,  and  command  and  control  objectives,  and  the 
ability  to  sustain  changes  in  design  over  the  program  life 
cycle. 

4.3  HIGH  LEVEL  OPERATING  SYSTEMS 

In  a fault  tolerant  system  the  nodes  of  the  cluster  are 
connected  in  a loosely  coupled  manner,  each 
maintaining  its  own  operating  system.  The  cluster 
software  is  a layer  that  runs  on  top  of  local  operating 
systems  running  on  each  computer.  The  high  availability 
applications  in  the  fault  tolerant  cluster  run  at  the  top 
level  cluster  software. 

In  the  High  Level  Architecture  the  Runtime 
Infrastructure  is  defined  as  a distributed  operating 
system  for  federates  and  federations. 

The  top-level  cluster  software  can  be  a good  support  for 
Runtime  Infrastructure,  and  for  the  command  and 
control  systems. 

4.4  HIER.4RCHICAL  ARCHITECTURE 

A complex  simulation  or  command  and  control  centers 
must  be  considered  as  a hierarchy  of  components  of 
increasing  levels  of  aggregation.  At  the  lowest  level  is 
the  model  of  a system  component.  This  may  be  a 
mathematical  model,  a discrete-event  queuing  model,  a 
rule-based  model,  etc. 

In  HLA  the  model  is  implemented  in  software  to 
produce  a simulation.  When  this  simulation  is 
implemented  as  part  of  an  HLA-compliant  simulation,  it 
is  referred  to  as  a federate.  HLA  simulations  are  made  up 
of  a number  of  HLA  federates  and  are  called  federations. 
Simulations  that  use  the  HLA  are  modular  in  nature 
allowing  federates  to  join  and  resign  from  the  federation 
as  the  simulation  executes. 

Based  on  functions  provided  by  specific  subsystems,  all 
fault-tolerant  clusters  are  partitioned  at  several  levels, 
but  in  addition  it  contains  redundant  components  and 
recovery  mechanisms  which  may  be  employed  in 
different  ways  at  different  levels.  It  is  reasonable  to  view 
fault-tolerant  clusters  as  a nested  set  of  subsystems,  each 
of  which  may  display  varying  levels  of  fault  tolerance. 
Recovery  from  a fault  within  a redundant  partition  may 
be  effected  within  the  domain  itself,  or  may  require 
action  by  higher  levels  within  the  system. 
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At  top  of  an}'  fault  tolerant  cluster,  command  and 
control,  and  High  Level  Architecture  compliant  system 
there  is  a distributed  operating  system  that  runs  on  top  of 
local  operating  systems  running  on  each  computer  or  on 
top  of  federates  and  federations. 

4.5  INTEROPERABILITY  AND  REUSABILITY 

More  work  is  needed  today  for  being  in  the  happy  case 
of  buying  hardware  and  software  components  from 
various  vendors,  integrating  them  in  applications  to  form 
complete  systems.  There  is  a need  for  software 
components  to  be  able  to  communicate  with  each  other 
using  “standard”  mechanisms  and  “open”  interfaces  for 
an  effective  integration  to  occur.  As  software  and 
hardware  systems  get  more  comple.x,  the  need  for 
interoperability  among  different  components  becomes 
critical. 

The  High  Level  Architecture  can  be  conceptual 
“software  bus”  that  allow  applications  to  communicate 
with  one  another,  regardless  of  who  designed  them,  the 
platform  they  are  running  on,  the  language  they  are 
wiitten  in,  and  where  they  are  running.  High  Level 
Architecture  also  enables  the  building  of  a plug-and-play 
component  software  environment. 

The  fault  tolerant  cluster  can  offer  a good  architecture 
for  command  and  control  systems  and  High  Level 
Architecture  compliant  systems  to  work  with  these 
applications.  The  fault  tolerant  cluster  is  an  open  and 
distributed  system,  flexible  and  extensible,  and  its 
architecture  offer  compliance  to  the  principle  of 
reusability  and  interoperability. 

4.6  HIGH  AVAILABILITY  SYSTEMS 

The  military  systems  must  not  succumb  to  different 
faults  and  must  continue  to  operate  reliably  in  spite  of 
occasional  occurrences  of  component  failures.  A fault 
tolerant  solution  improves  data  and  applications 
availability  by  allowing  that  a number  of  nodes  to  share 
the  same  hard  disks  within  a cluster.  When  a node  in  the 
cluster  fails,  the  fault  tolerant  cluster  software  will 
recover  and  disperse  the  work  from  the  failed  node  to 
another  node  within  the  cluster.  As  a result,  the  failure  of 
a system  in  the  cluster  will  not  affect  the  other  systems, 
and  in  most  cases,  the  client  applications  and  data  will  be 
completely  unaw'are  of  the  failure. 

In  command,  control,  communications,  computers, 
intelligence,  surveillance,  and  reconnaissance  systems 
information  related  to  the  battlcspace  is  complex  and 
dynamic  and  must  flow  quickly  among  all  tactical, 
strategic,  and  supporting  elements.  There  is  an 
unprecedented  increase  in  the  amount  of  information 
necessary  to  conduct  operational  planning  and  combat 
decision-making.  For  the  command  and  common 
systems,  and  for  the  modeling  and  simulation  systems 


high  availability  of  data  and  applications  is  very 
important. 

Fault  tolerance  is  the  best  guarantee  that  the  systems  will 
be  available,  and  the  essential  services  will  be  offered  in 
real-time  to  the  users.  The  modem  fault  tolerant  clusters 
are  able  to  eliminate  all  single  points  of  failure  in  the 
nodes  of  the  cluster,  the  disk  used  to  store  applications  or 
data,  the  networks,  the  power  sources,  the  data,  and  the 
applications  and  to  offer  the  best  high  availability 
architecture  for  command  and  control  systems,  and 
modeling  and  simulation  systems. 
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